========================================================

Introduction

I chose to look at 2012 presidential campaign contributions for the state of Ohio. Ohio has long been a critical swing state for presidential elections and according to Wikipedia, of all the swing states, has the current longest streak of matching the overall election outcome (since 1960). Campaign contributions aren’t necessarily the best (or even a strong) predictor of votes, however it does give us some idea of voter sentiment. The ability to predict contribution amount could also be of use to presidential candidates on a campaign trail. This data set has contributions at the zipcode level which, with the help of choropleth maps, will enable us to visualize relationships.

Univariate Plots Section

##  [1] "zipcode"     "candidate"   "name"        "city"        "state"      
##  [6] "employer"    "occupation"  "amount"      "date"        "gender"     
## [11] "party"       "population"  "pcnt_wht"    "pcnt_blk"    "pcnt_asn"   
## [16] "pcnt_hsp"    "percap_incm" "med_rent"    "med_age"
## 'data.frame':    151479 obs. of  19 variables:
##  $ zipcode    : Factor w/ 1037 levels "43001","43002",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ candidate  : Factor w/ 14 levels "Bachmann, Michele",..: 7 8 7 12 12 12 12 7 12 7 ...
##  $ name       : chr  "BAKER, NANCY" "WHITE, TIMOTHY CHRISTOPHER" "BRIGGS, DEAN" "CHAULK, SARAH" ...
##  $ city       : Factor w/ 1167 levels ":POLAND","`LOVELAND",..: 10 10 10 10 10 10 10 10 10 10 ...
##  $ state      : Factor w/ 1 level "OH": 1 1 1 1 1 1 1 1 1 1 ...
##  $ employer   : Factor w/ 13507 levels "","(SELF) GREEN LEAF LAWN CARE",..: 1006 1238 3163 10429 5653 5653 9833 1006 1662 1006 ...
##  $ occupation : Factor w/ 6846 levels "","-","100% DISABLED VIETNAM VETERAN",..: 4148 4102 4649 2594 2975 2975 5237 4148 901 4148 ...
##  $ amount     : num  35 250 50 546 125 ...
##  $ date       : Date, format: "2012-06-25" "2011-05-26" ...
##  $ gender     : Factor w/ 2 levels "female","male": 1 2 2 1 2 2 2 1 2 1 ...
##  $ party      : Factor w/ 3 levels "democrat","green",..: 1 3 1 3 3 3 3 1 3 1 ...
##  $ population : num  2295 2295 2295 2295 2295 ...
##  $ pcnt_wht   : num  93 93 93 93 93 93 93 93 93 93 ...
##  $ pcnt_blk   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ pcnt_asn   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ pcnt_hsp   : num  3 3 3 3 3 3 3 3 3 3 ...
##  $ percap_incm: num  34306 34306 34306 34306 34306 ...
##  $ med_rent   : num  592 592 592 592 592 592 592 592 592 592 ...
##  $ med_age    : num  46.2 46.2 46.2 46.2 46.2 46.2 46.2 46.2 46.2 46.2 ...
##     zipcode                candidate         name          
##  44122  :  2203   Obama, Barack :91286   Length:151479     
##  43214  :  1745   Romney, Mitt  :50672   Class :character  
##  45208  :  1724   Paul, Ron     : 4271   Mode  :character  
##  45243  :  1718   Santorum, Rick: 2012                     
##  43221  :  1694   Gingrich, Newt: 1432                     
##  44118  :  1566   Cain, Herman  :  583                     
##  (Other):140829   (Other)       : 1223                     
##          city        state      
##  CINCINNATI: 18596   OH:151479  
##  COLUMBUS  : 13820              
##  DAYTON    :  5536              
##  CLEVELAND :  4748              
##  TOLEDO    :  3276              
##  AKRON     :  2965              
##  (Other)   :102538              
##                                    employer    
##  RETIRED                               :34269  
##  SELF-EMPLOYED                         :11503  
##  NOT EMPLOYED                          : 8657  
##  INFORMATION REQUESTED PER BEST EFFORTS: 6049  
##  INFORMATION REQUESTED                 : 4217  
##  (Other)                               :86741  
##  NA's                                  :   43  
##                                   occupation        amount       
##  RETIRED                               :38151   Min.   :    0.0  
##  INFORMATION REQUESTED PER BEST EFFORTS: 5778   1st Qu.:   25.0  
##  HOMEMAKER                             : 4674   Median :   50.0  
##  PHYSICIAN                             : 4458   Mean   :  215.5  
##  ATTORNEY                              : 4056   3rd Qu.:  150.0  
##  (Other)                               :94351   Max.   :15000.0  
##  NA's                                  :   11                    
##       date               gender             party         population   
##  Min.   :2011-01-28   female:67157   democrat  :91286   Min.   :    0  
##  1st Qu.:2012-07-05   male  :79580   green     :   21   1st Qu.:16076  
##  Median :2012-09-17   NA's  : 4742   republican:60172   Median :25049  
##  Mean   :2012-08-05                                     Mean   :26596  
##  3rd Qu.:2012-10-17                                     3rd Qu.:35078  
##  Max.   :2012-12-31                                     Max.   :68475  
##                                                         NA's   :989    
##     pcnt_wht         pcnt_blk        pcnt_asn         pcnt_hsp    
##  Min.   :  0.00   Min.   : 0.00   Min.   : 0.000   Min.   : 0.00  
##  1st Qu.: 76.00   1st Qu.: 2.00   1st Qu.: 1.000   1st Qu.: 1.00  
##  Median : 87.00   Median : 4.00   Median : 2.000   Median : 2.00  
##  Mean   : 79.75   Mean   :12.23   Mean   : 2.941   Mean   : 2.84  
##  3rd Qu.: 93.00   3rd Qu.:13.00   3rd Qu.: 4.000   3rd Qu.: 3.00  
##  Max.   :100.00   Max.   :94.00   Max.   :22.000   Max.   :61.00  
##  NA's   :1003     NA's   :1003    NA's   :1003     NA's   :1003   
##   percap_incm       med_rent         med_age     
##  Min.   :  864   Min.   : 213.0   Min.   : 6.80  
##  1st Qu.:23951   1st Qu.: 550.0   1st Qu.:36.30  
##  Median :30291   Median : 643.0   Median :39.50  
##  Mean   :32178   Mean   : 668.5   Mean   :39.25  
##  3rd Qu.:38854   3rd Qu.: 749.0   3rd Qu.:43.10  
##  Max.   :67742   Max.   :1475.0   Max.   :83.50  
##  NA's   :1039    NA's   :1761     NA's   :1008

The median contribution is $50 but the average is $215. Most contributions were made by males or democrats.

We have long-tailed data. There are so many contributions made under $1,000 that it’s hard to see any of the outliers.

A logarithmic transformation of the x-axis reveals something of a log-normal distribution with what could be a mean of $100. Even so, we can see that the distribution is heavier below this mean.

There are a significant group of people (4416) that, despite the long-tailed distribution, contribute $2,500.

Looking closer we can see that there appear to be several discrete values in increments of $50 that people are accustomed to contributing.

Under $60, we can see contributions spaced in intervals of $5.

An overwhelming majority of contributions (93.5%) were made in 2012.

We can see a steady increase in contributions leading up to the election in November.

There seems to be a slight increase in the amount of contributions made toward the end of the month. There is also a peak at about halfway through the month. People might be making contributions immediately after receiving their paychecks.

Although a significant difference between the amount of male and female contributions, the proportion (0.54 in favor of males) of the gap is not very large.

Most contributions are made to either Obama or Romney. Using a log scale we can see the other candidates a little better. There are a significant amount of contributions made to other candidates but they are mostly Republican. It would probably be better to use party instead of candidate as a predictor.

The proportion of contributions to democrats vs. republicans seems to resemble the proportion in the previous histrogram between Obama and Romney. This would make sense as the number of contributions to other candidates is small in comparison to these two. Also, the amount of contributions to the green party is so small (21 contributions) that we might want to exclude for simplicity.

It’s important to remember that the remaining demographic variables correspond to the contributor’s zipcode and not to the contributor him/herself.

The distribution of population in which contributors live is fairly normal with mean 26596 and median 25049.

Most contributors live in areas with a high percentage of white ethnicity and a very low percentage of black, asian, or hispanic ethnicity.

The distribution of per capita income in which contributors live is fairly normal with most living in zipcodes with a range of per capita income of about $20,000 - $40,000. The average is $32178 and the median is $30291.

Again, we see a fairly normal distribution of median rent in the locations in which contributors live. Rent is very cheap (median of $643) as compared to California but per capita income is also lower.

The median age in which contributors live also resembles a normal distrubtion with median 39.5

A choropleth map of total contributions by zipcode shows that there are hotspots of contributions. Because the zipcodes are so small in these hotspots we might assume that they are cities.

With an overlay of city location, we can see that total contributions are higher nearest to cities.

A map of total population per zipcode is definitely similar to the map of total contributions but it doesn’t seem like an exact match. It could be that the higher the population in a zipcode, the more contributions are made. There may also be more contributions made from affluent suburbs with higher per capita income regardless of total population.

Univariate Analysis

What is the structure of your dataset?

There are 151,479 instances of campaign contribution in the dataset with 19 features. From the original data set 11 features were kept or derived:

  • candidate
  • name (contributor)
  • city
  • state
  • zipcode
  • employer
  • occupation
  • amount
  • date
  • gender
  • party

Using the zipcode feature, demographic information was added from another data set:

  • total_population
  • percent_white
  • percent_black
  • percent_asian
  • percent_hispanic
  • per_capita_income
  • median_rent
  • median_age

From the original data set, all but name, amount, and date are factors. None of the factors are ordered. Name is a character, amount is numeric, and date is a date object. The 8 demographic features are all numeric.

Other observations:

  • The largest group of contributors by occupation are retirees
  • About 60% of contributions are made by democrats
  • The median contribution amount is $50 and the maximum is $15,000
  • Barack Obama received 60% of contributions
  • Women accounted for 45% of contributors
  • The number of monthly contributions show an exponential increase as the election approaches

What is/are the main feature(s) of interest in your dataset?

The main features of interest in the data set are amount, gender, party, per capita income, and median age. I would like to see if these factors are correlated with contribution amount. Occupation could be of interest however there are too many levels (6,846).

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

I believe that the percent ethnicities, total population, and median rent may be correlated with contribution amount.

Did you create any new variables from existing variables in the dataset?

I created two new variables, one for the gender of the contributor based upon the first name, and another for the party of the contributor based upon the candidate that received the contribution. I was unable to programatically determine gender by first name for about 4,742 instances (approx. 3% of the data).

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

A log transformation of contribution amount revealed a log-normal distribution. Despite this, we can see in the non-skewed distribution that there are a significant group of people that donate the maximum allowable campaign contribution by law (approx. $2,600). There are also Political Action Committee (PAC) data in the set which have a larger limit (approx. $5,000). I am unsure of the validity of the outliers beyond this amount because of my limited knowledge of campaign finance law. That there were several negative amounts which needed to be corrected to positive leads me to believe that there could be further inaccuracies in the data set.
The data came in a tidy format and did not need to be transformed.

Bivariate Plots Section

##                  amount   population    pcnt_wht     pcnt_blk    pcnt_asn
## amount       1.00000000 -0.058151631  0.03627490 -0.035720759  0.04039765
## population  -0.05815163  1.000000000 -0.06037725 -0.005842785  0.18167597
## pcnt_wht     0.03627490 -0.060377249  1.00000000 -0.970755919 -0.09407256
## pcnt_blk    -0.03572076 -0.005842785 -0.97075592  1.000000000 -0.08506698
## pcnt_asn     0.04039765  0.181675973 -0.09407256 -0.085066978  1.00000000
## pcnt_hsp    -0.03254065  0.184148197 -0.22500887  0.065098125  0.07400635
## percap_incm  0.16022853 -0.030815036  0.25904630 -0.312223470  0.46629216
## med_rent     0.07274805  0.156892519  0.12470890 -0.198731584  0.51243822
## med_age      0.08008519 -0.173767903  0.26874345 -0.186164441 -0.22395102
## time        -0.11039226  0.032613802  0.01492987 -0.019443600  0.01347171
##                pcnt_hsp   percap_incm    med_rent      med_age
## amount      -0.03254065  0.1602285332  0.07274805  0.080085191
## population   0.18414820 -0.0308150357  0.15689252 -0.173767903
## pcnt_wht    -0.22500887  0.2590463011  0.12470890  0.268743448
## pcnt_blk     0.06509812 -0.3122234697 -0.19873158 -0.186164441
## pcnt_asn     0.07400635  0.4662921634  0.51243822 -0.223951022
## pcnt_hsp     1.00000000 -0.1271504849 -0.04696094 -0.223390263
## percap_incm -0.12715048  1.0000000000  0.74808626  0.334139211
## med_rent    -0.04696094  0.7480862610  1.00000000  0.168406483
## med_age     -0.22339026  0.3341392105  0.16840648  1.000000000
## time         0.00600853  0.0006534885  0.01281548 -0.003242782
##                      time
## amount      -0.1103922576
## population   0.0326138020
## pcnt_wht     0.0149298671
## pcnt_blk    -0.0194436000
## pcnt_asn     0.0134717088
## pcnt_hsp     0.0060085300
## percap_incm  0.0006534885
## med_rent     0.0128154786
## med_age     -0.0032427822
## time         1.0000000000

None of the numeric variables seem to be strongly correlated with amount although all are significantly correlated with it (absolute value greater than 3%). Despite the increase in contributions towards the election, the amount is negatively correlated with increasing time.

The factored variables of gender and party were not included in the correlation analysis or pairs plot so we should take a closer look at these.

The mean contribution amount as well as the IQ range is larger for males than females.

The mean contribution amount as well as the IQ range is larger for republicans than democrats.

The following are plots of variables of interest by amount with correlation statistics.

## [1] 0.1584399

## [1] -0.1104779

## [1] 0.08065799

## [1] 0.07274805

## [1] -0.05841065

## [1] 0.03966285

## [1] 0.03676803

## [1] -0.03602072

## [1] -0.03267223

Despite some significant correlation values here, they are difficult to see when plotted. I think that these relationship are very weak as far as being able to predict contribution amount.

Here we can see that contributions early on tend to be larger and with a greater range. The data for 2011 contributions is also much smaller than for 2012 so this may be a factor.

Neither month nor day seem to correlate much with amount.

In this choropleth map, the location of the population centers are not as apparent. Average contribution amount doesn’t seem to correlate with a city center as much as total contributions.

Here we can see the population centers.

Average contribution amount may more closely resemble per capita income rather than total population. Higher values surround the city centers with something of a buffer. Also, there are some areas away from the city centers with high average contribution amount. These could be affluent rural areas or areas with little data.

Total amount of contribution per zipcode does again seem to correlate with city proximity.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Relatively speaking, there were no strong relationship discovered in the numerical correlations. There were however significant correlations (abs. value > 3%) among all of the variables and amount. Per capita income had the strongest correlation (16%) followed by time (numerical date), median age, median rent, total population, percent asian, percent white, percent black, and percent hispanic with the lowest (-3.25%).
Time, total population, percent black, and percent hispanic were all negatively correlated. Among the ordered factors of gender and party, republican contributions were on average higher than democrat, as were male contributions higher than female.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

I did not expect that time and total population would be correlated with contribution amount. It appears that early on contributions are largest, which might make sense to support a candidate for a longer campaign. Total population seems a bit arbitrary as zipcodes are not necessarily zoned for equal area.

What was the strongest relationship you found?

The strongest relationship among all the variables was that between percent white and percent black of a contributors location. These are negatively correlated at 97%. Among the variables of interest the strongest correlation was between per capita income and amount which I suspected to be so.

Multivariate Plots Section

Contributions to the Democratic party came from a slight female majority whereas contributions to the Republican party came from an overwhelming male majority.

Although a relatively small proportion, males tend to contribute slightly more at higher amounts to the Democratic party than females. There appears to be no caveat for the Republican party (males make more contributions at all amounts). Also, at higher amounts, Republicans make more contributions than Democrats.

Plotting relationships between variables of interest and amount by gender and party. Each plot contains both a LOESS and LM smoothing method.

The positive correlation between contribution amount and per capita income seems to be much more pronounced with Republicans than Democrats.

Both males and Republicans seem to ‘rally’ behind their candidate leading up to election time with an increase in contribution amount as compared to their female or Democrat counterparts.

Although this demographic information does not necessarily reflect the contributor, both Republicans and males show a stronger positive correlation between median age of their locale and personal contribution amount than do their female or Democrat counterparts.

Median rent by party is similar to per capita income by party.

Population by gender or party does not give us much more insight.

The percent ethnicities seem to show erratic trends with LOESS.

Contribution amount per person seems to more closely resemble the map of average contribution amount rather than total contributions.

## 
## Calls:
## m1: lm(formula = I(log(amount)) ~ I(percap_incm), data = subset(model_df, 
##     amount < 2700))
## m2: lm(formula = I(log(amount)) ~ I(percap_incm) + gender, data = subset(model_df, 
##     amount < 2700))
## m3: lm(formula = I(log(amount)) ~ I(percap_incm) + party, data = subset(model_df, 
##     amount < 2700))
## 
## ===============================================================
##                                 m1          m2          m3     
## ---------------------------------------------------------------
## (Intercept)                   3.573***    3.347***    3.318*** 
##                              (0.011)     (0.011)     (0.010)   
## I(percap_incm)                0.000***    0.000***    0.000*** 
##                              (0.000)     (0.000)     (0.000)   
## gender: male/female                       0.438***             
##                                          (0.007)               
## party: republican/democrat                            1.074*** 
##                                                      (0.007)   
## ---------------------------------------------------------------
## R-squared                         0.028       0.054       0.177
## adj. R-squared                    0.028       0.054       0.177
## sigma                             1.335       1.317       1.229
## F                              4128.020    4108.751   15532.538
## p                                 0.000       0.000       0.000
## Log-likelihood              -247362.113 -245400.794 -235324.957
## Deviance                     258224.633  251323.923  218675.314
## AIC                          494730.226  490809.588  470657.915
## BIC                          494759.876  490849.120  470697.448
## N                            144815      144815      144815    
## ===============================================================

Our best model only accounts for about 18% of the variation in donation amount.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

I looked at amount against all of the significantly correlated features but adding in gender and/or party as a third variable. In nearly all of the comparisons gender and party proved to be a significant in differentiating total contribution amount. Specifically, contribution amount was higher for males than females as was it for Republicans versus Democrats. Also, because of the majority male constituency for Republican doners, we see that male trends generally mirror Republican trends as do female trends mirror Democratic trends.

Were there any interesting or surprising interactions between features?

I knew that gender and party might be significant factors but I did not know to what extent (these features could not be analyzed in the correlation table). I was somewhat surprised to see that differences in gender and party were universal across all other features with respect to contribution amount. An interesting finding was that Democratic contribution amounts show little increase with increasing per capita income of the contributors demographic as compared to Republican contribution amounts. Also, looking at contribution amount over time by party showed that, despite both parties having larger contribution amounts earlier on, Republicans increased their contribution amount leading up to the election whereas Democrats do not. The same is true for males over females but the relationship is less pronounced. This sort of last minute increase in contribution amount reminds me of a type of rally behavior. Whether or not this is effective in catapulting a candidate to nomination is a whole other question altogether but I doubt it to be so (especially since Romney lost Ohio in 2012). Another interesting difference between males and females is that as the median age of the demographic of the contributor increases, male contribution amount tends to increase whereas female contribution shows a slight decrease. The same idea applies for Republicans and Democrats, Republicans showing an increase in amount as the median age of the contributor’s zipcode increases but holding steady for Democrats. If median age of the contributors zipcode did in fact reflect the actual age of the contributor, we could hypothesize that females are less inclined to donate large amounts as they get older.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

I experimented with several different models and found that one which modeled the log of amount with per capita income, gender, and party was the most effective in explaining variance in contribution amount. I tried to include time as a factor because it was highly negatively correlated with amount however adding this feature only decreased the R-squared value. Adding all of the remaining significantly correlated features had the same effect to decrease the R-squared value.


Final Plots and Summary

Plot One

Description One

The first plot shows both how contribution amounts are distributed log-normally and how contributions increase leading up to an election.

Plot Two

Description Two

This plot grid shows that, in Ohio, most Democratic contributions were made by females and an overwhelming majority of Republican contributions were made by males. Also that, in general, Republican and male contribution amounts are higher than Democrat and female contribution amounts. An interesting rally phenomenon can be seen here with Republican party as the election date approaches.

Plot Three

Description Three

This final plot shows that, despite total contribution amount being highly correlated with city center, average contribution amount is highest surrounding a city and also in some rural areas. This may shed light on why presidential candidates spend a significant amount of time campaigning in suburbs and seemingly rural areas.


Reflection

In my investigation of 2012 Presidential Campaign Contributions for the state of Ohio, I chose to focus on finding the most significant features of a contributors information that could be used to predict the actual contribution amount. The most significant features proved to be per capita income of the area in which the contributor lives (which we can assume gives an idea of the contributor him or herself), the gender of the contributor, and the political party affiliation of the contributor (simplified to be either Democrat or Republican). Per capita income of the contributor’s zipcode has a positive correlation with the contributor’s contribution amount. Males have on average higher contribution amounts than females as do Republicans versus Democrats. This would lead one to conclude that given these correlated features, the highest contribution amount could belong to a male Republican who lives in an area with high per capita income. The lowest contribution amount might belong to a contributor who is a female Democrat and who lives in an area with a low per capita income.

Despite these findings, the model that was developed was only able to account for about 18% of the variation in contribution amount. It would have been great to have actual income, age, and ethnicity of the contributor him/herself. I believe that these would have had a much higher correlation than the demographic information of the contributors zipcode. The demographic information was at best a rough approximation of the contributor.

With regard to the choropleth maps, it seemed apparent that total contributions and total contribution amount were highly correlated with city proximity. Average contribution appeared to be higher closer to cities, but with a buffer between the actual city center and high average contribution amounts.

There are several shortcomings of the data set. First, I question the validity of some of the information as several contribution amounts had to be changed from negative to positive. Second, as compared to other zipcodes, some lacked a substantial amount of data. This may have skewed the average contribution choropleth map. Another shortcoming was the inability to programatically determine the gender by first name of the contributor for about 4% of the data. I had a difficult time with regex in R, were I more adept at this, that data might have been included.

If possible, further analysis could include distance, or some measure of proximity, to a city center. The choropleth maps that were generated attempted to show a spatial relationship between amount and cities. This was however at best an approximation without any concrete measurements to back-up the claims/insights. To do this, an average latitude and longitude value could be calculated and added as a variable for each zipcode, and another variable could be added for distance to the closest city.